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Abstract 


Neural Network based approximations of the Value function make up 
the core of leading Policy Based methods such as Trust Regional Policy 
Optimization (TRPO) and Proximal Policy Optimization (PPO). While 
this adds significant value when dealing with very complex environments, 
we note that in sufficiently low state and action space environments, a 
computationally expensive Neural Network architecture offers marginal 
improvement over simpler Value approximation methods. We present 
an implementation of Natural Actor Critic algorithms with actor updates 
through Natural Policy Gradient methods. This paper proposes that Nat- 
ural Policy Gradient (NPG) methods with Linear Function Approxima- 
tion as a paradigm for value approximation may surpass the performance 
and speed of Neural Network based models such as TRPO and PPO within 
these environments. Over Reinforcement Learning benchmarks Cart Pole 
and Acrobot, we observe that our algorithm trains much faster than com- 
plex neural network architectures, and obtains an equivalent or greater 
result. This allows us to recommend the use of NPG methods with Linear 
Function Approximation over TRPO and PPO for both traditional and 
sparse reward low dimensional problems. 


1 Introduction 


Reinforcement Learning (RL) is a paradigm where an agent seeks to max- 
imize the reward it gains through refining its policy. At each timestep t, our 
agent observes the environmental state, and according to policy m it takes some 
action. This action changes the environmental state and returns some reward, 
and this is used to retrain the policy (Figure 1). This simple procedure has ap- 
plications in numerous fields, such as developing Self Driving Cars[1] or training 
a robot to solve a rubix cube[2]. 


S 
m policy Tig(s,a) 
a 
Agent N 
= Find action A, according 
to Tolsa) 


R |S Execute A, 


Sua Environment \ 


Ray Obtain State S, 
and Reward R,,, 


Figure 1: Simplified RL procedure 


The objective of our algorithm is to maximize the cumulative reward the 
agent obtains over a number of timesteps t. As such we hope to determine the 
policy that results in the maximum expected cumulative reward. There exist 
2 main branches of RL to determine find this policy: Value Based Methods 
(such as Q-Learing) and Policy Based Methods (such as Actor Critic). We 
will focus on Policy-based methods due to their more stable performance and 
predictable convergence. One of the most well researched Policy based methods 
is Actor Critic, which we can describe as follows: Actor critic methods feature 
an actor (which updates policy through gradients) and a critic (which estimates 
the value function)[3, 4, 5]. These elements work together in tandem to develop 
a system that guarantees convergence for linear and non-linear(neural network) 
approximations of the value function, along with reducing variance compared to 
standard Policy based Methods. Actor critic methods are used in many fields, 
such as Robotics[6] and Network control[7]. There also is a variation of actor 
critic, called Natural Actor Critic. Instead of using standard gradient descent 
to update the policy during the actor step, Natural Actor Critic utilizes natural 
gradient descent. This results in a much quicker convergence to the optimal 
policy. 

Currently, most implementations of Policy Based methods utilize complex 
Neural Networks to estimate the value function. This is what lies at the core 
of industry standard algorithms such as Trust Regional Policy Optimization or 
Proximal Policy Optimization. Although Neural Network based methods such 
as Trust Region Policy Optimization (TRPO)|8] and Proximal Policy Optimiza- 
tion (PPO)|[9]are widely used for the vast majority of RL Applications, they still 
retain the drawbacks of Neural Network type architectures. The most notable 


of these are the computationally expensive nature of Neural Networks and the 
difficulty for development. In addition, we theorize that there are a multitude 
of scenarios where the nature of the problem does not necessitate Neural Net- 
works. With this in mind, we aim to use Linear Function Approximation (LFA) 
as a simpler paradigm to implement RL algorithms. We aim to accomplish this 
through the use of a Natural Policy Gradient (NPG) algorithm that utilizes 
LFA (LFA-NPG). We evaluate these methods with regards to three aspects: 


e Performance (Reward vs. number of policy iterations): In our testing, we 
observe that LFA-NPG will match the performance of PPO and TRPO in 
standard applications, and outperform these algorithms in sparse reward 
environments. 


e Speed (Time vs. Reward): We observe that LFA-NPG is noticeably faster 
than TRPO and PPO in standard applications, and in sparse reward en- 
vironments it significantly outperforms TRPO, while matching the per- 
formance of PPO. 


e Robustness (Performance over levels of noise): We note that in some en- 
vironments, LFA-NPG exhibits a higher level of resistance against adver- 
sarial noise than TRPO and PPO. 


For the remainder of this paper, we will conduct an analysis of leading Policy 
based methods, TRPO and PPO. We then outline the structure and design of 
our LFA-NPG algorithm, and proceed to compare it with TRPO and PPO. We 
then summarize our findings and draw conclusions. 


2 Literature Review 


We begin our analysis with an assessment of modern RL. Policy Gradient 
Methods are a type of RL methods that optimize a parametrized policy through 
gradient descent. We have decided to focus on Policy Gradient Methods for a 
few reasons: 


e Policy gradient methods may utilize the same optimization techniques for 
states with adversarial noise (Ç) 


e The same methods may be used to describe discrete and continuous action 
spaces 


e Policy Gradient Methods can utilize knowledge of the problem to minimize 
the parameters required for learning 


e Policy Gradient Methods may function both with and without a model 


Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization 
(PPO) are the two most researched and implemented Policy Gradient Methods. 
We will now establish an understanding of the principles behind each algorithm, 
to develop an intuitive understanding of the strengths and drawbacks of each. 
Trust Regional Policy Optimization (TRPO)[8]: TRPO updates each pol- 
icy by taking the largest possible step possible within some constraint. This 
constraint is known as the Kullback-Leibler-Divergence[10], which is analogous 


to a measure of the distance between probability distributions. To accomplish 
this large step, TRPO employs complex second order methods to ensure an op- 
timal performance. This is the notable distinction between TRPO and normal 
policy gradient algorithms, which keep policies relatively close to each other in 
parameter space. Often, we observe that small differences in parameter space 
may have significant effect on performance, which requires us to avoid large 
step sizes with normal policy gradient methods. However, TRPO avoids these 
pitfalls, which enables it to quickly improve its performance. 

Proximal Policy Optimization (PPO)[9]: PPO is a policy gradient method 
that may define its objective as taking the largest possible policy step within 
some limiting constraint, similar to TRPO. PPO aims to utilize a set of first- 
order methods to attain the same results as TRPO within a simpler framework. 
It differs from TRPO in that it has no constraint, but rather relies on clipping 
the reward function to remove incentives for policy steps to be sufficiently large. 


3 Purpose 


Our objective is comprised of three parts: 


1. Can Linear Function Approximation based Natural Policy Gradient (meth- 
ods match the performance of state of the art neural network based algo- 
rithms such as TRPO and PPO? 


2. Can LFA reach optimal rewards in less time than leading algorithms such 
as TRPO and PPO? 


3. How do NPG algorithms compare with TRPO and PPO with regards to 
noise resistance? 


We hope to demonstrate that Natural Policy Gradient Algorithms can match 
performance and reach rewards in less time than the leading RL algorithms, 
TRPO and PPO. If we can successfully demonstrate this, then we provide vali- 
dation for the use of LFA architectures over Neural Networks in many use cases. 


4 Methodology 


In order to develop an algorithm that can develop an intelligent agent, we 
require some framework in order to model the problem. To accomplish this, we 
turn to the Markov Decision Process (MDP), a mathematically idealized form 
of the challenge, to which we can make precise theoretical statements [11]. We 
may define the finite MDP as a 5-tuple (S,A,R,P, 7). At each timestep k, the 
agent takes some action A, E€ A(s) depending on the environment state Sk E€ S 
and receives some reward R(s,%,a%) C R, and observes some new environmental 
state S41 according to transition probability P. This results in a sequence 
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As per the definition of finite MDP, we may take S, A, and R as discrete and 
finite sets. This allows us to conclude that for each s € S and r € R, we 


can determine the probability of those values occurring at timestep t, given the 
preceding action and state: 


p(s ,r|s,a) =Pr{ Sk s, Ry r|Sk-1 = 8, Ap_1 = a} 


for all s,s € S, a € A(s), and r E€ R. The dynamics of our MDP system can 
then be described as follows: 

At timestep k, observe environmental state Sẹ E€ S. Determine Ak € A 
according to policy 7, Ap ~ 7(-|S_).The system then reaches some new state 
based on the determined transition probabilities P(S,41 = -|S,, Ak), and re- 
turns reward R(S;, Ax). We seek to maximize our cumulative reward, which we 
define through our value function. This may be expressed as 


V” (s) =E[)) y* (Sk, Ag)|S0 = u, Ax ~ (-[S4)] 
k=0 


where p is an initial distribution over all states. We also define our value function 
V"(p) = Es~p|V"(s)]. Given this, we define the objective of our algorithm as 
to find the optimal policy 7*, such that 
T* € arg max V” (p) 
nell 

where II is the set of all policies. Having defined our objective, we proceed to 
determine a system to maximize the reward. Parametrizing the policy with a 
parameter 0, we aim to find some 7 to maximize V”? (p). We may then redefine 
our objective as 0*, such that 0* € arg maxo V™’ (p). As per our definition of 
policy, we know J, 1(a|s) = 1 for all s € S. In order to parametrize our policy 
T, we introduce a set of function approximators, sa € R?. We then may define 
our policy in terms of ¢ and 6, as 


_ exp ($5.08) 
als)“ S exp(2,8) 


Algorithm 1 Sampler for: s,a ~ d7 and unbiased estimate of Q7(s, a) 


Require: Starting state-action distribution v 
1: Sample so, ao ~ V 
2: Sample s,a ~ d}, such that at each timestep h, with probability y, select 
actions according to 7, else accept (sh, an) as the sample and progress to 
Step 4. 


3: From sp,an, continue to act according to m with termination probability of 
1— y. After termination, set Ge 
from time h on. 


4: return s;,,@,, and Â" (sn, an) 


as the undiscounted sum of rewards 


As such, given ¢, we aim to find the optimal vector 0. We plan to op- 
timize 0 via gradient ascent. We update gradient ascent through 6,4, = 
Ok + nVV™(p),0 € R? to find the optimal 6*. Utilizing the policy gradient 
theorem, we then find 


VV™—E [Q% 4) Valogra(als) 


Za = E |X 7 R(Sk, An) |S0 = Hs Ao = a, Ar ~ (Sx) 
k=0 


Knowing this, we may then estimate VV*« , and then update policy through 
each iteration. To accomplish this, we utilize Natural Actor Critic [12]. We can 
break down the actor critic method into two steps: 


e Critic: Estimate VV"% with Linear Function Approximation (Algorithm1) 


e Actor: Update policy with Value-based Natural policy Gradient algorithm 
(LFA-NPG), utilizing Natural Gradient Descent (Algorithm 2): 


O41 = Or + H(Ox)VV™ (p), 6 € R? 


Algorithm 2 Sample-based LFA-NPG for Log-linear Policies 


Require: Learning rate 7; Standard Gradient Descent (SGD) Learning Rate 
a; Number of SGD iterations N 
Initialize 0° = 0 
1: for t = 0,1,...,T-1 do 
2: Initialize wo = 0 
3: for n=0,1,....N-1 do Call Algorithm 1 to obtain Q(s,a) and s,a ~ d® 
4: Update with SGD: 


Wn+1 = Projw(wn zi 2a(wn : bs,a = Q(s, a))@s,a) 
where W = {w : ||wll2 < W} 
5 end for 
6 Set HO = x aaa Wn, 
7: Update 0+) = 9 +0 ®) 
8: end for 


Natural Gradient Descent: Natural Gradient Descent is a version of gradient 
descent where each step is multiplied by a Fisher Matrix. In standard Gradient 
Descent (SGD), we observe that the Gradient of the value function will be 
small if the predicted distribution is close to the true distribution, and large if 
the predicted distribution is far from the true distribution. In Natural Gradient 
Descent, we do not restrict the movement of the parameters in the space, and 
instead control the movement of the output distribution at each step. We do this 
by measuring the curve of the Log Likelihood of the probability distribution, or 
the Fisher Information. This then allows us to focus the direction of convergence 
directly towards the global minimum, as opposed to SGD, which may not have 
as direct or swift a convergence. 

We then find our Natural Policy Gradient Algorithm: 

OTE -oT hsa) 


w argmin E ~ 
t€ 81 snd, °F ANTO, 


Ok+1 = Ôk + WE 
where E[Q"(s,a)] = Q7(s,a), 0 € RI, w € RI, dsa E RV 5,0 


An intuitive analysis of LFA-NPG: The agent tests its policy in simulation 
with termination probability y. After either the environment closes or the algo- 
rithm terminates, the final state, action, and cumulative reward produced are 
returned(Algorithm 1). The cumulative reward, state, and action are then used 
to run SGD on w. If the norm of w, ||W]|| exceeds some limit W, divide w 
by ||w||. The average w over N iterations is determined, and used to update 
0. This occurs for T iterations, after which point we expect to have found the 
optimal 6* that maximizes V™ (p). 

We now go on to consider the set of function approximators for our policy, @. 
When we parametrize our policy 7, we describe it in terms of 6, the vector we 
optimize, and ¢, the set of function approximators. ¢ is defined through s and 
a, where we find @ as a diagonal a by a x s matrix, with all nonzero elements 
as the state s. Intuitively, we find that the dimensionality of ¢ determines the 
dimensionality of 0, and as a result of this has control over the performance and 
speed of LFA-NPG. By modifying the dimensions of the state vector s by adding 
or removing challenge specific parameters (ex: Add an element to s that includes 
the sin of the angle between the pole and the cart (Cart pole simulation), we 
can further optimize the performance of LFA-NPG. As a guideline, we expect 
the optimal ¢ to have: 


e All elements of s such that they cannot be derived from other elements 
within s through simple operations 


e All elements of s exert sufficient influence over V7? (p) 


.Therefore, an analysis of the environment and challenge is necessary to obtain 
the optimal ¢ for LFA-NPG. This is one of the fundamental distinctions between 
Linear function approximations of the value function, and Neural Network based 
approximations: Algorithms such as TRPO and PPO are capable of determining 
the optimal ¢ function themselves, ensuring that they maximize their potential. 
In complex, high dimensional challenges, this self selecting system is one of the 
greatest strengths of Neural Networks. However, we hypothesize that at lower 
dimensional systems, one could obtain the optimal ¢ through manual analysis. 
This places LFA-NPG on equal footing with TRPO and PPO with regards to 
Q. 


5 Results and Discussion 


To determine the effectiveness of LFA-NPG in relation to TRPO and PPO, 
we test each algorithm on across 2 simulated environments (Figure 2): 


e CartPole: Consider a system comprising of a cart that may move from 
side to side, with a joint connecting the cart to one end of a pole. The 
objective of our agent is to find the optimal policy so as to balance the 
pole atop the cart, by applying a force (left or right) to either the left 
or right[13]. A reward of 1 is given for each timestep the pole is kept 
upwards, and after: a reward of 200.0 is reached; the pole falls over; or 
the cart goes out of bounds, the episode terminates. 


e Acrobot: Consider a system featuring 2 links and 2 joints. Our objective is 
to swing the end of the lower link above a given height as fast as possible by 


J 


(a) Cartpole-v0 (b) Acrobot-v1 


Figure 2: Visualization of CartPole and Acrobot 


applying torque (Clockwise, counterclockwise, or none) on the top link[14, 
15]. A reward of -1 is given for each episode where the bottom link is below 
our target height, and the episode terminates after either 500 timesteps, 
or after the bottom link reaches the goal height. 


The first step of testing begins with determining the optimal hyperparam- 
eters. The set of hyperparameters for LFA-NPG includes the NPG Weight n, 
SGD step size , Number of Actor iterations T, Number of Critic Iterations R, 
and the set of function approximators ¢. To determine the optimal set of func- 
tion approximators to maximize V™’ (p), a through analysis of the problem is 
required. We begin with our analysis of the CartPole problem Our state S € Rt 
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Figure 3: Optimizing the set of function approximators ¢ for LFA-NPG 


contains the position and velocity of the cart, along with the angle and angular 
velocity of the pole. We may find it fit to alter the dimensionality of S such that 
S € R’, where S now includes the position and velocity of the cart, sine and 
cosine of the angle of the pole, angular velocity of the pole, and sine and cosine 
of the angular velocity of the pole. We predict that this ø should offer a better 
result, as more features are learned each iteration, and we can observe a more 
complex relationship. We proceed to test LFA-NPG on CartPole, and compare 
the Reward per Policy Iteration to validate this assumption(Figure 3). For all 
future results of LFA-NPG in Cartpole, we use S € R”. We find our other 
hyperparameters as T = 25; N = 150; = 0.1;a = 0.1;W = 10!?;7 = 0.951. 
Analysis of Acrobot problem: Our initial state contains the sine and cosine of 
the two joint angles, and the angular velocities of each link. As S € R®, we have 
a sufficiently high dimensionality already. Therefore, we only add the sin of the 
angular velocity between the two joints to S, such that S € R”. Our rationale 
is that in scenarios where the angle between the two joints is ~ 7, we should 


observe a greater reward. We may characterize Acrobot as an sparse reward 
problem: The agent will get a constant minimum reward (—500) until it explores 
and finds a greater reward. In acrobot, this occurs when our agent has a policy 
that returns reward R > —500. Once more, we test LFA-NPG on Acrobot, and 
compare the performance for 6 and 7 dimensional ¢ (Figure 3). While after a 
greater number of policy iterations the reward for both ¢s are comparable, we 
note that S € R? reaches these rewards after half the number of iterations. For 
all future results of LFA-NPG in Acrobot, we use S € R”. We find our other 
hyperparameters as T = 60; N = 80; = 1;a = 0.0001; W = 10!2; y = 0.95!. 

We now proceed to compare the performance of LFA-NPG to that of flagship 
Neural Networks, TRPO and PPO?.(Figure 4) 
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Figure 4: Comparing LFA-NPG with Flagship Neural Networks 


Having drawn comparisons between LFA-NPG, TRPO, and PPO, we now 
turn our attention to the task of analyzing the robustness of each algorithm. 
Robustness analysis involves sampling a state with some randomly sampled 
deviation of the true state, and observing how the performance of the algorithm 
is affected by this deviation. Mathematically we modify the returned state such 
that each element S% > se’. gr = SP x z;z ~ (14+ ¢,1-—¢). We may then 
compare the effect of this Noise level ¢ on the convergence and performance of 
algorithms (Figure 5). 


lHyperparameters were experimentally determined 
2 All tests were conducted on the same device; Intel i5-8250U CPU @1.60 GHz, 8 GB RAM, 
Intel UHD Graphics 620 
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(c) Cartpole: PPO Reward vs Policy 
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(e) Acrobot: TRPO Reward vs Policy 
Iterations 
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(d) Acrobot: LFA-NPG Reward vs Policy 
Iterations 
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Figure 5: LFA-NPG, TRPO, and PPO Robustness Analysis 
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6 Conclusion 


In this work, we compare our LFA-NPG Natural Actor Critic algorithm with 
the 2 main flagship RL Algorithms, TRPO and PPO. Intuitively, we expected 
that with lower dimensional discrete challenges such as CartPole and Acrobot, 
LFA-NPG should attain a similar Result as TRPO and PPO, and we also hy- 
pothesize that LFA-NPG should reach each reward in a shorter duration of time. 
In the cartpole simulation, we find this hypothesis to be wholly supported: We 
are successfully able to demonstrate that LFA-NPG outperforms the leading RL 
algorithms, and is able to achieve each reward in time 20%+ faster than either 
TRPO or PPO. This makes intuitive sense, as the natural gradient architecture 
of LFA-NPG is significantly less computationally expensive as a complex Neural 
Network architecture, and therefore can be executed much swifter. 

When testing these same algorithms on the sparse reward environment of 
Acrobot, we observe LFA-NPG noticeably outperforms both PPO and TRPO, 
and we also observe LFA-NPG and PPO acheiving a similar speed, which is 
significantly faster than that of TRPO. We believe that LFA-NPG is able to 
drastically outperform these more complex architectures due to the nature of 
an sparse reward challenge: Specifically, that the nature of Natural Gradient 
Descent means that it will take large steps if it is not able to find the optimal 
result. This enables it to find a policy that works faster than TRPO or PPO, 
which are constrained by the KL-Divergence. We also observe a distinction in 
performance between TRPO and PPO, where PPO is able to attain a better 
result. We attribute this to the nature of PPOs multi-batch updates, which allow 
it to ensure that it improves its reward after each successive step, enabling it 
to improve its performance with swift updates. The difference in performance 
between PPO and TRPO may also be attributed to overfitting, where TRPO 
may be too cautious in each update to converge swiftly. 

We also draw conclusions on the robustness of each algorithm. In CartPole, 
We observe that LFA-NPG is wholly resistant to ¢ < 10, and exhibits more 
variance and gains less reward with more variance. TRPO and PPO both have 
similar levels of noise resistant, and gain a similar reward with larger levels 
of noise. In the acrobot simulation, we observe that LFA-NPG is very noise 
resistant, exhibiting effectively the same performance across a wide spread of 
Ç. TRPO has an initial drop in performance once Ç # 0, and PPO has a 
very gradual decline in performance as ¢ increases. With this in mind, we may 
summarize the results of our research as follows: 


e In standard low dimensional challenges, Natural Policy Gradient Methods 
such as LFA-NPG reach optimal rewards in noticeably less time than state 
of the art Complex Neural Networks, while maintaining a similar rate of 
convergence 


e In sparse reward low dimensional challenges, Natural Policy Gradient 
Methods converge significantly faster than state of the art Complex Neural 
Networks, while achieving the same or greater speed 


This allows us to conclude that in a challenge with sufficiently small state and 
action space, Natural Policy Gradients are a better choice than the leading 
neural networks in RL, validating our initial assumption. When combining this 
with their comparative simplicity to practically implement, we demonstrate that 
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algorithms such as LFA-NPG are very effective for a multitude of RL Use Cases, 
and often surpass their more complex brethren. The potential applications of 
LFA-NPG range from classic control applications from RL Literature to backend 
applications within more complex systems (such as optimizing a function to 
control the velocity of the drone). In the future, we would hope to formulate 
a version of LFA-NPG that can find the optimal policy for a continuous action 
space. This necessitates the formulation of a ¢ function that includes the action 
within the matrix. We also hope to formulate a form of LFA-NPG that has a 
method to self select the optimal ¢, through some linear method. 
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